-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-19497: [ABFS] Enable rename and create recovery from client transaction id over DFS endpoint #7509
Conversation
============================================================
|
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
* @param op The ABFS operation result for the rename attempt. | ||
* @param isMetadataIncompleteState Flag indicating if metadata is incomplete. | ||
* @throws IOException If an I/O error occurs during the rename operation. | ||
* @throws FileAlreadyExistsException If the destination file already exists and overwrite is unauthorized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unauthorized Blob Overwrite error we only get when the destination file already exists. For invalid permissions we have separate errors thrown. We can remove the "unauthorized overwrite" parts here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed and overwrite is unauthorized
from the message
try { | ||
incrementAbfsRenamePath(); | ||
op.execute(tracingContext); | ||
// AbfsClientResult contains the AbfsOperation, If recovery happened or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar doesn't look correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction Done.
return clientTransactionId.equals( | ||
abfsHttpOperation.getResponseHeader(X_MS_CLIENT_TRANSACTION_ID)); | ||
} catch (AzureBlobFileSystemException exception) { | ||
throw new AbfsDriverException(ERR_RENAME_RECOVERY, exception); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the path in the exception message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken!
@@ -275,9 +284,14 @@ public void testRenameRecoveryEtagMatchFsLevel() throws IOException { | |||
// 4 calls should have happened in total for rename | |||
// 1 -> original rename rest call, 2 -> first retry, | |||
// +2 for getPathStatus calls | |||
int totalConnections = 4; | |||
if (!getConfiguration().getIsClientTransactionIdEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why ! here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of recovery using client transaction id: It makes total 4 calls
- getFileStatus - AzureBlobFileSystem
- rename (without retry)
- rename (1st retry)
- getPathStatus -> to get client transaction Id
In case of recovery using Etag: It makes total 5 calls
- getFileStatus - AzureBlobFileSystem
- getPathStatus - to fetch etag of source
- rename (without retry)
- rename (1st retry)
- getPathStatus - to fetch etag of destination
Before this change, 2. getPathStatus - to fetch etag of source
in case 2 was done even in case 1.
// 1 -> original rename rest call, 2 -> first retry, | ||
// +1 for getPathStatus calls | ||
// last getPathStatus call should be skipped | ||
assertThatStatisticCounter(ioStats, | ||
CONNECTIONS_MADE.getStatName()) | ||
.isEqualTo(newConnections + connMadeBeforeRename); | ||
.isEqualTo(4 + connMadeBeforeRename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be hardcoded here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the rename operation, we are adding 4 extra connections, which is why it’s kept like this. We are following the same format in other places as well.
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments
@@ -199,13 +199,10 @@ public String toString() { | |||
} | |||
|
|||
public static ApiVersion getCurrentVersion() { | |||
return DEC_12_2019; | |||
return NOV_04_2024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we upgrading version for all the APIs? Or just Rename-delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are updating it for all the APIs. It was discussed offline to not keep separate versions for different APIs and instead go for an overall API version upgrade.
isMetadataIncompleteState); | ||
} catch (AzureBlobFileSystemException e) { | ||
// Handle rename exceptions and retry if applicable | ||
handleRenameException(source, destination, continuation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a rename call happening inside this as well and another recovery happening below.
Can you explain whats the difference and add some comments around it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is the existing code where in case we get RENAME_DESTINATION_PARENT_PATH_NOT_FOUND error from server, we retrigger the rename operation after fetching src etag value.
Recovery happens in case server error is SOURCE_PATH_NOT_FOUND.
// ref: HADOOP-18242. Rename failure occurring due to a rare case of
// tracking metadata being in incomplete state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Please add some code comments to make all this more readable and easily understandable.
In the main Javadoc for both types(Etag and CT) of rename it would be godd to explain the whole flow of recovery
@@ -2289,6 +2289,6 @@ public void answer(final AbfsRestOperation mockedObj, | |||
null, op); | |||
} | |||
} | |||
}); | |||
}, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using mockAbfsOperationCreation to simulate HTTP failures. In some cases, we need the first call to fail, and in others, we need the second call to fail, depending on which call is a rename in the flow. This change is necessary to specify which API call should fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
…ansaction id over DFS endpoint (apache#7509) Contributed by Manish Bhatt Reiewed by Anmol Asrani, Anuj Modi, Manika Joshi Signed off by: Anuj Modi<[email protected]>
Jira: https://issues.apache.org/jira/browse/HADOOP-19497
Linked PR: #7364
We have implemented create and rename recovery using client transaction IDs over the DFS endpoint (HADOOP-19450 [ABFS] Rename/Create path idempotency client-level resolution - ASF JIRA). Since the backend changes were not fully rolled out, we initially implemented the changes with the flag disabled. With this update, we aim to enable the flag, which will start sending client transaction IDs. In case of a failure, we will attempt to recover from the failed state if possible. Here are the detailed steps and considerations for this process:
Implementation Overview: We introduced a mechanism for create and rename recovery via client transaction IDs to enhance reliability and data integrity over the DFS endpoint. This change was initially flagged as disabled due to incomplete backend rollouts.
Current Update: With the backend changes now in place, we are ready to enable the flag. This will activate the sending of client transaction IDs, allowing us to track and manage transactions more effectively.
Failure Recovery: The primary advantage of enabling this flag is the potential for recovery from failed states. If a transaction fails, we can use the client transaction ID to attempt a recovery, minimizing data loss and ensuring continuity.
Next Steps: We will proceed with enabling the flag and closely monitor the system's performance. Any issues or failures will be documented and addressed promptly to ensure a smooth transition.